Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Sensorless Drive Diagnosis is a multi-class classification situation where we are trying to predict one of the several possible outcomes.
INTRODUCTION: The dataset contains features extracted from electric current drive signals. The drive has both intact and defective components. The signals can result in 11 different classes with different conditions. Each condition has been measured several times by 12 different operating conditions, such as speeds, load moments, and load forces.
In iteration Take1, we established the baseline accuracy measurement for comparison with future rounds of modeling.
In this iteration, we will standardize the numeric attributes and observe the impact of scaling on modeling accuracy.
ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 85.53%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.
In this iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 85.34%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.
By standardizing the dataset features, the ensemble algorithms continued to perform well. However, standardizing the features appeared to have little impact on the overall modeling accuracy.
CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, Random Forest could be considered for further modeling.
Dataset Used: Sensorless Drive Diagnosis Data Set
Dataset ML Model: Multi-class classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(stringr)
# Create the random seed number for reproducible results
seedNum <- 888
# Set up the notifyStatus flag to stop sending progress emails (setting to TRUE will send status emails!)
notifyStatus <- TRUE
if (notifyStatus) library(mailR)
## Registered S3 method overwritten by 'R.oo':
## method from
## throw.default R.methodsS3
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
# Set up the email notification function
email_notify <- function(msg=""){
sender <- Sys.getenv("MAIL_SENDER")
receiver <- Sys.getenv("MAIL_RECEIVER")
gateway <- Sys.getenv("SMTP_GATEWAY")
smtpuser <- Sys.getenv("SMTP_USERNAME")
password <- Sys.getenv("SMTP_PASSWORD")
sbj_line <- "Notification from R Multi-Class Classification Script"
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3b6eb2ec}"
# Slicing up the document path to get the final destination file name
dataset_path <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00325/Sensorless_drive_diagnosis.txt'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]
if (!file.exists(dest_file)) {
# Download the document from the website
cat("Downloading", dataset_path, "as", dest_file, "\n")
download.file(dataset_path, dest_file, mode = "wb")
cat(dest_file, "downloaded!\n")
# unzip(dest_file)
# cat(dest_file, "unpacked!\n")
}
inputFile <- dest_file
colNames <- paste0("attr",1:48)
colNames <- c(colNames, 'targetVar')
Xy_original <- read.csv(inputFile, sep=' ', header=FALSE, col.names = colNames)
# Take a peek at the dataframe after the import
head(Xy_original)
## attr1 attr2 attr3 attr4 attr5 attr6
## 1 -3.0146e-07 8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2 2.9132e-06 -5.2477e-06 3.3421e-06 -6.0561e-06 2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06 1.7394e-05
## 4 -1.3226e-06 8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07 4.1439e-06
## 5 -6.8366e-08 5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07 1.3491e-05
## 6 -9.5849e-07 5.2143e-08 -4.7359e-05 6.4537e-07 -2.3041e-06 5.4999e-05
## attr7 attr8 attr9 attr10 attr11 attr12 attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
## attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
## attr21 attr22 attr23 attr24 attr25 attr26 attr27
## 1 0.89669 0.89658 0.89658 0.89656 0.00768040 0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621 0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572 0.00205630 0.466570 0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572 0.00048305 0.164030 -0.13124
## attr28 attr29 attr30 attr31 attr32 attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140 0.119110 0.31117 0.0010932 0.0010911 0.0010682
## 3 0.00044468 -0.162300 0.56210 0.0028942 0.0029030 0.0028851
## 4 0.00693590 -0.467240 0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650 0.343380 0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862 2.21850 -0.0028981 -0.0028984 -0.0028680
## attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732 4.3662 6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404 1.3977 3.6048 -0.59314
## 3 0.00035014 0.00035803 0.00037366 -0.67146 2.8072 5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766 7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720 5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298 7.3162 3.9757 -0.61124
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## 1 2.9646 8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996 1
## 2 7.6252 6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005 1
## 3 2.7784 5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985 1
## 4 6.5534 6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976 1
## 5 4.5155 9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959 1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973 1
sapply(Xy_original, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "integer"
sapply(Xy_original, function(x) sum(is.na(x)))
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## 0 0 0 0 0 0 0
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## 0 0 0 0 0 0 0
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## 0 0 0 0 0 0 0
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## 0 0 0 0 0 0 0
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## 0 0 0 0 0 0 0
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## 0 0 0 0 0 0 0
## attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## 0 0 0 0 0 0 0
# Convert columns from one data type to another
Xy_original$targetVar <- as.factor(Xy_original$targetVar)
# Take a peek at the dataframe after the cleaning
head(Xy_original)
## attr1 attr2 attr3 attr4 attr5 attr6
## 1 -3.0146e-07 8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2 2.9132e-06 -5.2477e-06 3.3421e-06 -6.0561e-06 2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06 1.7394e-05
## 4 -1.3226e-06 8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07 4.1439e-06
## 5 -6.8366e-08 5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07 1.3491e-05
## 6 -9.5849e-07 5.2143e-08 -4.7359e-05 6.4537e-07 -2.3041e-06 5.4999e-05
## attr7 attr8 attr9 attr10 attr11 attr12 attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
## attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
## attr21 attr22 attr23 attr24 attr25 attr26 attr27
## 1 0.89669 0.89658 0.89658 0.89656 0.00768040 0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621 0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572 0.00205630 0.466570 0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572 0.00048305 0.164030 -0.13124
## attr28 attr29 attr30 attr31 attr32 attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140 0.119110 0.31117 0.0010932 0.0010911 0.0010682
## 3 0.00044468 -0.162300 0.56210 0.0028942 0.0029030 0.0028851
## 4 0.00693590 -0.467240 0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650 0.343380 0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862 2.21850 -0.0028981 -0.0028984 -0.0028680
## attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732 4.3662 6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404 1.3977 3.6048 -0.59314
## 3 0.00035014 0.00035803 0.00037366 -0.67146 2.8072 5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766 7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720 5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298 7.3162 3.9757 -0.61124
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## 1 2.9646 8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996 1
## 2 7.6252 6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005 1
## 3 2.7784 5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985 1
## 4 6.5534 6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976 1
## 5 4.5155 9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959 1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973 1
sapply(Xy_original, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## 0 0 0 0 0 0 0
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## 0 0 0 0 0 0 0
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## 0 0 0 0 0 0 0
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## 0 0 0 0 0 0 0
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## 0 0 0 0 0 0 0
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## 0 0 0 0 0 0 0
## attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## 0 0 0 0 0 0 0
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
# Standardize the class column to the name of targetVar if applicable
# colnames(Xy_original)[targetCol] <- "targetVar"
# We create attribute-only and target-only datasets (X_original and y_original)
# for various visualization and cleaning/transformation operations
if (targetCol==1) {
X_original <- Xy_original[,(targetCol+1):totCol]
y_original <- Xy_original[,targetCol]
} else {
X_original <- Xy_original[,1:(totAttr)]
y_original <- Xy_original[,totCol]
}
dim(Xy_original)
## [1] 58509 49
dim(X_original)
## [1] 58509 48
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 3
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 3 by 16
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@30c7da1e}"
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2812cbfa}"
head(Xy_original)
## attr1 attr2 attr3 attr4 attr5 attr6
## 1 -3.0146e-07 8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2 2.9132e-06 -5.2477e-06 3.3421e-06 -6.0561e-06 2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06 1.7394e-05
## 4 -1.3226e-06 8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07 4.1439e-06
## 5 -6.8366e-08 5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07 1.3491e-05
## 6 -9.5849e-07 5.2143e-08 -4.7359e-05 6.4537e-07 -2.3041e-06 5.4999e-05
## attr7 attr8 attr9 attr10 attr11 attr12 attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
## attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
## attr21 attr22 attr23 attr24 attr25 attr26 attr27
## 1 0.89669 0.89658 0.89658 0.89656 0.00768040 0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621 0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572 0.00205630 0.466570 0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572 0.00048305 0.164030 -0.13124
## attr28 attr29 attr30 attr31 attr32 attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140 0.119110 0.31117 0.0010932 0.0010911 0.0010682
## 3 0.00044468 -0.162300 0.56210 0.0028942 0.0029030 0.0028851
## 4 0.00693590 -0.467240 0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650 0.343380 0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862 2.21850 -0.0028981 -0.0028984 -0.0028680
## attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732 4.3662 6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404 1.3977 3.6048 -0.59314
## 3 0.00035014 0.00035803 0.00037366 -0.67146 2.8072 5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766 7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720 5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298 7.3162 3.9757 -0.61124
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## 1 2.9646 8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996 1
## 2 7.6252 6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005 1
## 3 2.7784 5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985 1
## 4 6.5534 6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976 1
## 5 4.5155 9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959 1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973 1
dim(Xy_original)
## [1] 58509 49
sapply(Xy_original, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 targetVar
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor"
summary(Xy_original)
## attr1 attr2 attr3
## Min. :-1.372e-02 Min. :-5.414e-03 Min. :-1.358e-02
## 1st Qu.:-7.431e-06 1st Qu.:-1.444e-05 1st Qu.:-7.240e-05
## Median :-2.653e-06 Median : 8.800e-07 Median : 5.140e-07
## Mean :-3.333e-06 Mean : 1.440e-06 Mean : 1.412e-06
## 3rd Qu.: 1.571e-06 3rd Qu.: 1.878e-05 3rd Qu.: 7.520e-05
## Max. : 5.784e-03 Max. : 4.525e-03 Max. : 5.238e-03
##
## attr4 attr5 attr6
## Min. :-1.279e-02 Min. :-8.356e-03 Min. :-9.741e-03
## 1st Qu.:-5.418e-06 1st Qu.:-1.475e-05 1st Qu.:-7.379e-05
## Median :-1.059e-06 Median : 7.540e-07 Median :-1.660e-07
## Mean :-1.313e-06 Mean : 1.351e-06 Mean :-2.650e-07
## 3rd Qu.: 3.555e-06 3rd Qu.: 1.906e-05 3rd Qu.: 7.139e-05
## Max. : 1.453e-03 Max. : 8.245e-04 Max. : 2.754e-03
##
## attr7 attr8 attr9
## Min. :-0.139890 Min. :-0.135940 Min. :-0.130860
## 1st Qu.:-0.019927 1st Qu.:-0.019951 1st Qu.:-0.019925
## Median : 0.013226 Median : 0.013230 Median : 0.013247
## Mean : 0.001915 Mean : 0.001913 Mean : 0.001912
## 3rd Qu.: 0.024770 3rd Qu.: 0.024776 3rd Qu.: 0.024777
## Max. : 0.069125 Max. : 0.069130 Max. : 0.069131
##
## attr10 attr11 attr12
## Min. :-0.21864 Min. :-0.21860 Min. :-0.21863
## 1st Qu.:-0.03214 1st Qu.:-0.03216 1st Qu.:-0.03217
## Median :-0.01557 Median :-0.01559 Median :-0.01560
## Mean :-0.01190 Mean :-0.01190 Mean :-0.01190
## 3rd Qu.: 0.02061 3rd Qu.: 0.02062 3rd Qu.: 0.02060
## Max. : 0.35258 Max. : 0.35256 Max. : 0.35263
##
## attr13 attr14 attr15
## Min. :0.0007509 Min. :0.0001884 Min. :0.0003542
## 1st Qu.:0.0011368 1st Qu.:0.0005992 1st Qu.:0.0012566
## Median :0.0021989 Median :0.0011845 Median :0.0029800
## Mean :0.0018763 Mean :0.0010834 Mean :0.0030917
## 3rd Qu.:0.0025265 3rd Qu.:0.0014563 3rd Qu.:0.0043361
## Max. :0.1365700 Max. :0.0515430 Max. :0.1039300
##
## attr16 attr17 attr18
## Min. :0.0007445 Min. :0.0001889 Min. :0.000357
## 1st Qu.:0.0011394 1st Qu.:0.0005981 1st Qu.:0.001288
## Median :0.0021878 Median :0.0011820 Median :0.002891
## Mean :0.0018665 Mean :0.0010775 Mean :0.003076
## 3rd Qu.:0.0025230 3rd Qu.:0.0014538 3rd Qu.:0.004322
## Max. :0.1087700 Max. :0.0647640 Max. :0.078530
##
## attr19 attr20 attr21 attr22
## Min. :0.7976 Min. :0.7976 Min. :0.7976 Min. :0.7984
## 1st Qu.:1.3274 1st Qu.:1.3274 1st Qu.:1.3267 1st Qu.:1.3287
## Median :1.5732 Median :1.5731 Median :1.5729 Median :1.5726
## Mean :1.6183 Mean :1.6183 Mean :1.6178 Mean :1.6178
## 3rd Qu.:1.8858 3rd Qu.:1.8857 3rd Qu.:1.8849 3rd Qu.:1.8834
## Max. :2.3770 Max. :2.3769 Max. :2.3758 Max. :2.3728
##
## attr23 attr24 attr25
## Min. :0.7984 Min. :0.7984 Min. :-15.796000
## 1st Qu.:1.3287 1st Qu.:1.3281 1st Qu.: -0.006033
## Median :1.5725 Median :1.5724 Median : 0.003020
## Mean :1.6177 Mean :1.6173 Mean : 0.001909
## 3rd Qu.:1.8833 3rd Qu.:1.8825 3rd Qu.: 0.011576
## Max. :2.3726 Max. :2.3715 Max. : 28.285000
##
## attr26 attr27 attr28
## Min. :-12.351000 Min. :-7.959000 Min. :-11.903000
## 1st Qu.: -0.205900 1st Qu.:-0.453440 1st Qu.: -0.009230
## Median : 0.006513 Median :-0.000126 Median : 0.000168
## Mean : 0.008799 Mean :-0.003465 Mean : -0.000157
## 3rd Qu.: 0.220960 3rd Qu.: 0.445910 3rd Qu.: 0.008671
## Max. : 12.437000 Max. : 9.580300 Max. : 18.294000
##
## attr29 attr30 attr31
## Min. :-12.508000 Min. :-9.976600 Min. :-5.024e-02
## 1st Qu.: -0.203390 1st Qu.:-0.448040 1st Qu.:-5.102e-03
## Median : 0.008109 Median :-0.004195 Median : 4.520e-04
## Mean : 0.012089 Mean :-0.009958 Mean : 1.628e-05
## 3rd Qu.: 0.225560 3rd Qu.: 0.429030 3rd Qu.: 5.165e-03
## Max. : 10.977000 Max. : 8.764000 Max. : 8.638e-02
##
## attr32 attr33 attr34
## Min. :-0.0518910 Min. :-5.279e-02 Min. :-0.3377100
## 1st Qu.:-0.0051129 1st Qu.:-5.109e-03 1st Qu.:-0.0045218
## Median : 0.0004506 Median : 4.612e-04 Median :-0.0002775
## Mean : 0.0000142 Mean : 1.922e-05 Mean :-0.0000347
## 3rd Qu.: 0.0051651 3rd Qu.: 5.174e-03 3rd Qu.: 0.0049597
## Max. : 0.0864570 Max. : 8.655e-02 Max. : 0.1948200
##
## attr35 attr36 attr37
## Min. :-0.3377000 Min. :-0.3377500 Min. : -0.912
## 1st Qu.:-0.0045180 1st Qu.:-0.0044891 1st Qu.: -0.715
## Median :-0.0002735 Median :-0.0002740 Median : -0.664
## Mean :-0.0000376 Mean :-0.0000316 Mean : -0.463
## 3rd Qu.: 0.0049553 3rd Qu.: 0.0049666 3rd Qu.: -0.582
## Max. : 0.1902000 Max. : 0.1850300 Max. :4015.400
##
## attr38 attr39 attr40
## Min. : -0.618 Min. : 0.5222 Min. : -0.902
## 1st Qu.: 1.485 1st Qu.: 4.4513 1st Qu.: -0.715
## Median : 3.300 Median : 6.5668 Median : -0.662
## Mean : 7.447 Mean : 8.4068 Mean : -0.398
## 3rd Qu.: 8.373 3rd Qu.: 9.9526 3rd Qu.: -0.574
## Max. :312.520 Max. :265.3300 Max. :3670.800
##
## attr41 attr42 attr43 attr44
## Min. : -0.5968 Min. : 0.3207 Min. :-1.526 Min. :-1.526
## 1st Qu.: 1.4503 1st Qu.: 4.4363 1st Qu.:-1.503 1st Qu.:-1.503
## Median : 3.3013 Median : 6.4791 Median :-1.500 Median :-1.500
## Mean : 7.2938 Mean : 8.2738 Mean :-1.501 Mean :-1.501
## 3rd Qu.: 8.2885 3rd Qu.: 9.8575 3rd Qu.:-1.498 3rd Qu.:-1.498
## Max. :889.9300 Max. :153.1500 Max. :-1.458 Max. :-1.456
##
## attr45 attr46 attr47 attr48
## Min. :-1.524 Min. :-1.521 Min. :-1.523 Min. :-1.521
## 1st Qu.:-1.503 1st Qu.:-1.500 1st Qu.:-1.500 1st Qu.:-1.500
## Median :-1.500 Median :-1.498 Median :-1.498 Median :-1.498
## Mean :-1.501 Mean :-1.498 Mean :-1.498 Mean :-1.498
## 3rd Qu.:-1.498 3rd Qu.:-1.496 3rd Qu.:-1.496 3rd Qu.:-1.496
## Max. :-1.456 Max. :-1.337 Max. :-1.337 Max. :-1.337
##
## targetVar
## 1 : 5319
## 2 : 5319
## 3 : 5319
## 4 : 5319
## 5 : 5319
## 6 : 5319
## (Other):26595
cbind(freq=table(y_original), percentage=prop.table(table(y_original))*100)
## freq percentage
## 1 5319 9.090909
## 2 5319 9.090909
## 3 5319 9.090909
## 4 5319 9.090909
## 5 5319 9.090909
## 6 5319 9.090909
## 7 5319 9.090909
## 8 5319 9.090909
## 9 5319 9.090909
## 10 5319 9.090909
## 11 5319 9.090909
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(X_original[,i], main=names(X_original)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(X_original[,i], main=names(X_original)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(X_original[,i]), main=names(X_original)[i])
}
# Correlation matrix
correlations <- cor(X_original)
corrplot(correlations, method="circle")
if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2f7a2457}"
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5ebec15}"
# Apply feature scaling techniques
preProcValues <- preProcess(X_original, method = c("center", "scale", "YeoJohnson"))
X_transformed <- predict(preProcValues, X_original)
Xy_original <- cbind(X_transformed, y_original)
colnames(Xy_original)[totCol] <- "targetVar"
# Histograms each attribute
for(i in 1:totAttr) {
hist(X_transformed[,i], main=names(X_transformed)[i])
}
# Create various sub-datasets for visualization and cleaning/transformation operations.
set.seed(seedNum)
# Use 75% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.75, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]
if (targetCol==1) {
y_test <- Xy_test[,targetCol]
} else {
y_test <- Xy_test[,totCol]
}
# Not applicable for this iteration of the project
# Finalize the training and testing datasets for the modeling activities
dim(Xy_train)
## [1] 43890 49
dim(Xy_test)
## [1] 14619 49
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@38082d64}"
proc.time()-startTimeScript
## user system elapsed
## 59.849 1.040 70.108
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:
For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:
Linear Algorithm: Linear Discriminant Analysis
Non-Linear Algorithm: Decision Trees (CART)
Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
startModeling <- proc.time()
# Linear Discriminant Analysis (Classification)
if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@180bc464}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lda <- train(targetVar~., data=Xy_train, method="lda", metric=metricTarget, trControl=control)
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
print(fit.lda)
## Linear Discriminant Analysis
##
## 43890 samples
## 48 predictor
## 11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ...
## Resampling results:
##
## Accuracy Kappa
## 0.845614 0.8301754
proc.time()-startTimeModule
## user system elapsed
## 9.614 1.664 8.690
if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2d554825}"
# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4909b8da}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART
##
## 43890 samples
## 48 predictor
## 11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.09927318 0.43572568 0.3792982
## 0.09994987 0.23627250 0.1598997
## 0.10000000 0.09090909 0.0000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09927318.
proc.time()-startTimeModule
## user system elapsed
## 55.467 1.018 55.413
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@54a097cc}"
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@50f8360d}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 43890 samples
## 48 predictor
## 11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9879471 0.9867419
proc.time()-startTimeModule
## user system elapsed
## 1107.455 25.287 1105.384
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@337d0578}"
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2669b199}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 43890 samples
## 48 predictor
## 11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9989747 0.9988722
## 25 0.9933470 0.9926817
## 48 0.9899294 0.9889223
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 5167.401 8.427 5185.903
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3c756e4d}"
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4439f31e}"
startTimeModule <- proc.time()
set.seed(seedNum)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting
##
## 43890 samples
## 48 predictor
## 11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy
## 0.3 1 0.6 0.50 50 0.9185008
## 0.3 1 0.6 0.50 100 0.9616314
## 0.3 1 0.6 0.50 150 0.9754842
## 0.3 1 0.6 0.75 50 0.9204375
## 0.3 1 0.6 0.75 100 0.9610162
## 0.3 1 0.6 0.75 150 0.9751652
## 0.3 1 0.6 1.00 50 0.9202552
## 0.3 1 0.6 1.00 100 0.9621554
## 0.3 1 0.6 1.00 150 0.9752335
## 0.3 1 0.8 0.50 50 0.9180907
## 0.3 1 0.8 0.50 100 0.9606972
## 0.3 1 0.8 0.50 150 0.9747779
## 0.3 1 0.8 0.75 50 0.9203691
## 0.3 1 0.8 0.75 100 0.9616997
## 0.3 1 0.8 0.75 150 0.9753019
## 0.3 1 0.8 1.00 50 0.9203008
## 0.3 1 0.8 1.00 100 0.9616769
## 0.3 1 0.8 1.00 150 0.9754386
## 0.3 2 0.6 0.50 50 0.9863978
## 0.3 2 0.6 0.50 100 0.9958761
## 0.3 2 0.6 0.50 150 0.9975849
## 0.3 2 0.6 0.75 50 0.9856459
## 0.3 2 0.6 0.75 100 0.9957393
## 0.3 2 0.6 0.75 150 0.9976304
## 0.3 2 0.6 1.00 50 0.9856459
## 0.3 2 0.6 1.00 100 0.9963317
## 0.3 2 0.6 1.00 150 0.9980406
## 0.3 2 0.8 0.50 50 0.9855092
## 0.3 2 0.8 0.50 100 0.9956026
## 0.3 2 0.8 0.50 150 0.9974026
## 0.3 2 0.8 0.75 50 0.9856687
## 0.3 2 0.8 0.75 100 0.9961495
## 0.3 2 0.8 0.75 150 0.9978127
## 0.3 2 0.8 1.00 50 0.9856687
## 0.3 2 0.8 1.00 100 0.9962634
## 0.3 2 0.8 1.00 150 0.9978355
## 0.3 3 0.6 0.50 50 0.9952837
## 0.3 3 0.6 0.50 100 0.9983595
## 0.3 3 0.6 0.50 150 0.9987241
## 0.3 3 0.6 0.75 50 0.9959900
## 0.3 3 0.6 0.75 100 0.9984962
## 0.3 3 0.6 0.75 150 0.9986785
## 0.3 3 0.6 1.00 50 0.9962178
## 0.3 3 0.6 1.00 100 0.9986102
## 0.3 3 0.6 1.00 150 0.9988152
## 0.3 3 0.8 0.50 50 0.9955571
## 0.3 3 0.8 0.50 100 0.9982456
## 0.3 3 0.8 0.50 150 0.9984962
## 0.3 3 0.8 0.75 50 0.9958533
## 0.3 3 0.8 0.75 100 0.9985418
## 0.3 3 0.8 0.75 150 0.9987469
## 0.3 3 0.8 1.00 50 0.9960811
## 0.3 3 0.8 1.00 100 0.9984735
## 0.3 3 0.8 1.00 150 0.9988152
## 0.4 1 0.6 0.50 50 0.9419002
## 0.4 1 0.6 0.50 100 0.9714741
## 0.4 1 0.6 0.50 150 0.9825701
## 0.4 1 0.6 0.75 50 0.9442925
## 0.4 1 0.6 0.75 100 0.9718159
## 0.4 1 0.6 0.75 150 0.9825245
## 0.4 1 0.6 1.00 50 0.9442242
## 0.4 1 0.6 1.00 100 0.9719070
## 0.4 1 0.6 1.00 150 0.9828207
## 0.4 1 0.8 0.50 50 0.9424470
## 0.4 1 0.8 0.50 100 0.9716564
## 0.4 1 0.8 0.50 150 0.9828663
## 0.4 1 0.8 0.75 50 0.9438596
## 0.4 1 0.8 0.75 100 0.9715197
## 0.4 1 0.8 0.75 150 0.9827979
## 0.4 1 0.8 1.00 50 0.9443153
## 0.4 1 0.8 1.00 100 0.9720893
## 0.4 1 0.8 1.00 150 0.9828890
## 0.4 2 0.6 0.50 50 0.9919116
## 0.4 2 0.6 0.50 100 0.9972887
## 0.4 2 0.6 0.50 150 0.9979722
## 0.4 2 0.6 0.75 50 0.9917293
## 0.4 2 0.6 0.75 100 0.9975621
## 0.4 2 0.6 0.75 150 0.9983823
## 0.4 2 0.6 1.00 50 0.9923217
## 0.4 2 0.6 1.00 100 0.9976988
## 0.4 2 0.6 1.00 150 0.9984507
## 0.4 2 0.8 0.50 50 0.9917749
## 0.4 2 0.8 0.50 100 0.9971748
## 0.4 2 0.8 0.50 150 0.9981545
## 0.4 2 0.8 0.75 50 0.9916838
## 0.4 2 0.8 0.75 100 0.9975393
## 0.4 2 0.8 0.75 150 0.9983368
## 0.4 2 0.8 1.00 50 0.9921394
## 0.4 2 0.8 1.00 100 0.9977216
## 0.4 2 0.8 1.00 150 0.9985190
## 0.4 3 0.6 0.50 50 0.9973115
## 0.4 3 0.6 0.50 100 0.9984279
## 0.4 3 0.6 0.50 150 0.9985646
## 0.4 3 0.6 0.75 50 0.9974254
## 0.4 3 0.6 0.75 100 0.9987013
## 0.4 3 0.6 0.75 150 0.9988380
## 0.4 3 0.6 1.00 50 0.9977444
## 0.4 3 0.6 1.00 100 0.9987241
## 0.4 3 0.6 1.00 150 0.9988608
## 0.4 3 0.8 0.50 50 0.9973798
## 0.4 3 0.8 0.50 100 0.9985646
## 0.4 3 0.8 0.50 150 0.9986329
## 0.4 3 0.8 0.75 50 0.9974482
## 0.4 3 0.8 0.75 100 0.9987013
## 0.4 3 0.8 0.75 150 0.9988380
## 0.4 3 0.8 1.00 50 0.9978583
## 0.4 3 0.8 1.00 100 0.9987241
## 0.4 3 0.8 1.00 150 0.9987697
## Kappa
## 0.9103509
## 0.9577945
## 0.9730326
## 0.9124812
## 0.9571178
## 0.9726817
## 0.9122807
## 0.9583709
## 0.9727569
## 0.9098997
## 0.9567669
## 0.9722556
## 0.9124060
## 0.9578697
## 0.9728321
## 0.9123308
## 0.9578446
## 0.9729825
## 0.9850376
## 0.9954637
## 0.9973434
## 0.9842105
## 0.9953133
## 0.9973935
## 0.9842105
## 0.9959649
## 0.9978446
## 0.9840602
## 0.9951629
## 0.9971429
## 0.9842356
## 0.9957644
## 0.9975940
## 0.9842356
## 0.9958897
## 0.9976190
## 0.9948120
## 0.9981955
## 0.9985965
## 0.9955890
## 0.9983459
## 0.9985464
## 0.9958396
## 0.9984712
## 0.9986967
## 0.9951128
## 0.9980702
## 0.9983459
## 0.9954386
## 0.9983960
## 0.9986216
## 0.9956892
## 0.9983208
## 0.9986967
## 0.9360902
## 0.9686216
## 0.9808271
## 0.9387218
## 0.9689975
## 0.9807769
## 0.9386466
## 0.9690977
## 0.9811028
## 0.9366917
## 0.9688221
## 0.9811529
## 0.9382456
## 0.9686717
## 0.9810777
## 0.9387469
## 0.9692982
## 0.9811779
## 0.9911028
## 0.9970175
## 0.9977694
## 0.9909023
## 0.9973183
## 0.9982206
## 0.9915539
## 0.9974687
## 0.9982957
## 0.9909524
## 0.9968922
## 0.9979699
## 0.9908521
## 0.9972932
## 0.9981704
## 0.9913534
## 0.9974937
## 0.9983709
## 0.9970426
## 0.9982707
## 0.9984211
## 0.9971679
## 0.9985714
## 0.9987218
## 0.9975188
## 0.9985965
## 0.9987469
## 0.9971178
## 0.9984211
## 0.9984962
## 0.9971930
## 0.9985714
## 0.9987218
## 0.9976441
## 0.9985965
## 0.9986466
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
## and subsample = 1.
proc.time()-startTimeModule
## user system elapsed
## 34947.968 131.044 17676.156
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2a2d45ba}"
results <- resamples(list(LDA=fit.lda, CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LDA, CART, BagCART, RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.8395990 0.8418774 0.8446115 0.8456140 0.8494532 0.8528139 0
## CART 0.3634085 0.4534062 0.4535202 0.4357257 0.4540328 0.4543176 0
## BagCART 0.9854181 0.9863295 0.9876965 0.9879471 0.9895762 0.9906585 0
## RF 0.9974937 0.9988608 0.9990886 0.9989747 0.9993165 0.9995443 0
## GBM 0.9977216 0.9988608 0.9988608 0.9988608 0.9990886 0.9995443 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.8235589 0.8260652 0.8290727 0.8301754 0.8343985 0.8380952 0
## CART 0.2997494 0.3987469 0.3988722 0.3792982 0.3994361 0.3997494 0
## BagCART 0.9839599 0.9849624 0.9864662 0.9867419 0.9885338 0.9897243 0
## RF 0.9972431 0.9987469 0.9989975 0.9988722 0.9992481 0.9994987 0
## GBM 0.9974937 0.9987469 0.9987469 0.9987469 0.9989975 0.9994987 0
dotplot(results)
cat('The average accuracy from all models is:',
mean(c(results$values$`LDA~Accuracy`,results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)),'\n')
## The average accuracy from all models is: 0.8534245
cat('Total training time for all models:',proc.time()-startModeling)
## Total training time for all models: 41292.75 167.555 24051.11 0 0
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@457e2f02}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(2, 17, 33, 48))
fit.final1 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)
print(fit.final1)
## Random Forest
##
## 43890 samples
## 48 predictor
## 11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9990203 0.9989223
## 17 0.9954887 0.9950376
## 33 0.9918660 0.9910526
## 48 0.9899066 0.9888972
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 6759.321 8.359 6780.883
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@cb5822}"
# Tuning algorithm #2 - Gradient Boosting
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@28d25987}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100, 150, 200, 300), max_depth=3, eta=0.4, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=1)
fit.final2 <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)
print(fit.final2)
## eXtreme Gradient Boosting
##
## 43890 samples
## 48 predictor
## 11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ...
## Resampling results across tuning parameters:
##
## nrounds Accuracy Kappa
## 100 0.9988152 0.9986967
## 150 0.9989975 0.9988972
## 200 0.9989747 0.9988722
## 300 0.9989975 0.9988972
##
## Tuning parameter 'max_depth' was held constant at a value of 3
## 0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
## 1
## Tuning parameter 'subsample' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
## and subsample = 1.
proc.time()-startTimeModule
## user system elapsed
## 2062.975 3.140 1040.493
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@59f99ea}"
results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9977216 0.9988608 0.9993165 0.9990203 0.9993165 0.9995443 0
## GBM 0.9979494 0.9987469 0.9990886 0.9989975 0.9993165 0.9995443 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9974937 0.9987469 0.9992481 0.9989223 0.9992481 0.9994987 0
## GBM 0.9977444 0.9986216 0.9989975 0.9988972 0.9992481 0.9994987 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5c3bd550}"
predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11
## 1 1327 0 0 0 0 0 0 0 0 0 0
## 2 0 1327 0 0 0 0 0 0 4 2 0
## 3 0 0 1329 0 0 0 0 0 0 0 0
## 4 0 0 0 1329 0 0 0 0 0 0 0
## 5 0 0 0 0 1327 0 0 0 0 0 0
## 6 2 0 0 0 0 1328 0 0 0 0 0
## 7 0 0 0 0 0 0 1329 0 0 0 0
## 8 0 0 0 0 2 0 0 1329 1 0 0
## 9 0 0 0 0 0 1 0 0 1323 0 0
## 10 0 2 0 0 0 0 0 0 1 1327 0
## 11 0 0 0 0 0 0 0 0 0 0 1329
##
## Overall Statistics
##
## Accuracy : 0.999
## 95% CI : (0.9983, 0.9994)
## No Information Rate : 0.0909
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9989
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.99850 0.99850 1.00000 1.00000 0.99850 0.99925
## Specificity 1.00000 0.99955 1.00000 1.00000 1.00000 0.99985
## Pos Pred Value 1.00000 0.99550 1.00000 1.00000 1.00000 0.99850
## Neg Pred Value 0.99985 0.99985 1.00000 1.00000 0.99985 0.99992
## Prevalence 0.09091 0.09091 0.09091 0.09091 0.09091 0.09091
## Detection Rate 0.09077 0.09077 0.09091 0.09091 0.09077 0.09084
## Detection Prevalence 0.09077 0.09118 0.09091 0.09091 0.09077 0.09098
## Balanced Accuracy 0.99925 0.99902 1.00000 1.00000 0.99925 0.99955
## Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity 1.00000 1.00000 0.99549 0.99850 1.00000
## Specificity 1.00000 0.99977 0.99992 0.99977 1.00000
## Pos Pred Value 1.00000 0.99775 0.99924 0.99774 1.00000
## Neg Pred Value 1.00000 1.00000 0.99955 0.99985 1.00000
## Prevalence 0.09091 0.09091 0.09091 0.09091 0.09091
## Detection Rate 0.09091 0.09091 0.09050 0.09077 0.09091
## Detection Prevalence 0.09091 0.09111 0.09057 0.09098 0.09091
## Balanced Accuracy 1.00000 0.99989 0.99771 0.99913 1.00000
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11
## 1 1326 0 0 0 0 0 0 0 1 0 0
## 2 0 1327 0 0 0 0 0 0 1 1 0
## 3 0 0 1328 5 0 0 0 0 2 0 0
## 4 0 0 0 1324 0 0 0 0 0 0 0
## 5 0 0 0 0 1328 0 0 0 0 0 0
## 6 3 0 1 0 0 1329 0 0 0 0 0
## 7 0 0 0 0 0 0 1329 0 0 0 0
## 8 0 0 0 0 1 0 0 1329 0 0 0
## 9 0 0 0 0 0 0 0 0 1325 0 0
## 10 0 2 0 0 0 0 0 0 0 1328 0
## 11 0 0 0 0 0 0 0 0 0 0 1329
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9981, 0.9993)
## No Information Rate : 0.0909
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9987
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.99774 0.99850 0.99925 0.99624 0.99925 1.00000
## Specificity 0.99992 0.99985 0.99947 1.00000 1.00000 0.99970
## Pos Pred Value 0.99925 0.99850 0.99476 1.00000 1.00000 0.99700
## Neg Pred Value 0.99977 0.99985 0.99992 0.99962 0.99992 1.00000
## Prevalence 0.09091 0.09091 0.09091 0.09091 0.09091 0.09091
## Detection Rate 0.09070 0.09077 0.09084 0.09057 0.09084 0.09091
## Detection Prevalence 0.09077 0.09091 0.09132 0.09057 0.09084 0.09118
## Balanced Accuracy 0.99883 0.99917 0.99936 0.99812 0.99962 0.99985
## Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity 1.00000 1.00000 0.99699 0.99925 1.00000
## Specificity 1.00000 0.99992 1.00000 0.99985 1.00000
## Pos Pred Value 1.00000 0.99925 1.00000 0.99850 1.00000
## Neg Pred Value 1.00000 1.00000 0.99970 0.99992 1.00000
## Prevalence 0.09091 0.09091 0.09091 0.09091 0.09091
## Detection Rate 0.09091 0.09091 0.09064 0.09084 0.09091
## Detection Prevalence 0.09091 0.09098 0.09064 0.09098 0.09091
## Balanced Accuracy 1.00000 0.99996 0.99850 0.99955 1.00000
startTimeModule <- proc.time()
set.seed(seedNum)
# Combining datasets to form a complete dataset that will be used to train the final model
Xy_complete <- rbind(Xy_train, Xy_test)
# library(randomForest)
# finalModel <- randomForest(targetVar~., Xy_complete, mtry=3, na.action=na.omit)
# summary(finalModel)
proc.time()-startTimeModule
## user system elapsed
## 0.041 0.002 0.043
#saveRDS(finalModel, "./finalModel_MultiClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@39c0f4a}"
proc.time()-startTimeScript
## user system elapsed
## 50180.710 180.162 31956.413